Cross-lingual Name Tagging and Linking for 282 Languages
نویسندگان
چکیده
The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating “silver-standard” annotations by transferring annotations from English to other languages through crosslingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from crosslingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data. All the data sets, resources and systems for 282 languages are made publicly available as a new benchmark 1.
منابع مشابه
Name Tagging for Low-resource Incident Languages based on Expectation-driven Learning
In this paper we tackle a challenging name tagging problem in an emergent setting the tagger needs to be complete within a few hours for a new incident language (IL) using very few resources. Inspired by observing how human annotators attack this challenge, we propose a new expectation-driven learning framework. In this framework we rapidly acquire, categorize, structure and zoom in on ILspecif...
متن کاملCross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources
Training a POS tagging model with crosslingual transfer learning usually requires linguistic knowledge and resources about the relation between the source language and the target language. In this paper, we introduce a cross-lingual transfer learning model for POS tagging without ancillary resources such as parallel corpora. The proposed cross-lingual model utilizes a common BLSTM that enables ...
متن کاملRPI BLENDER TAC-KBP2016 System Description
We used Stanford Corenlp toolkit (Manning et al., 2014b) for English name tagging. To extract name mentions from Chinese and Spanish documents, we use bi-directional LSTMs (Long Short Term Memory) networks which can leverage long distance features. The input of the networks are pretrained word embeddings and randomly generalized character embeddings. Both word embedding and character embeddings...
متن کاملCross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources
Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary JRC-Who we are • European Commission (scientific-techni...
متن کاملError Analysis of Cross-lingual Tagging and Parsing
We thoroughly analyse the performance of cross-lingual tagger and parser transfer from English into 32 languages. We suggest potential remedies for identified issues and evaluate some of them.
متن کامل